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ASSESSING THE PROTECTION PROVIDED BY 
MISCLASSIFICATION-BASED DISCLOSURE LIMITATION 
METHODS FOR SURVEY MICRODATA 

By Natalie Shlomo and Chris Skinner 

University of Southampton 

Government statistical agencies often apply statistical disclosure 
limitation techniques to survey microdata to protect the confiden- 
tiality of respondents. There is a need for valid and practical ways 
to assess the protection provided. This paper develops some simple 
methods for disclosure limitation techniques which perturb the val- 
ues of categorical identifying variables. The methods are applied in 
numerical experiments based upon census data from the United King- 
dom which are subject to two perturbation techniques: data swapping 
(random and targeted) and the post randomization method. Some 
simplifying approximations to the measure of risk are found to work 
well in capturing the impacts of these techniques. These approxima- 
tions provide simple extensions of existing risk assessment methods 
based upon Poisson log-linear models. A numerical experiment is also 
undertaken to assess the impact of multivariate misclassification with 
an increasing number of identifying variables. It is found that the mis- 
classification dominates the usual monotone increasing relationship 
between this number and risk so that the risk eventually declines, 
implying less sensitivity of risk to choice of identifying variables. The 
methods developed in this paper may also be used to obtain more 
realistic assessments of risk which take account of the kinds of mea- 
surement and other nonsampling errors commonly arising in surveys. 

1. Introduction. Government statistical agencies have statutory and eth- 
ical obligations to protect the confidentiality of the data they collect. At the 
same time, their core mission is to ensure that these data are used effectively 
for statistical purposes. Tensions between these two objectives may arise, in 
particular, when access to microdata on individuals or establishments is to 
be provided to researchers, so that they may conduct their own analyses of 
social or economic phenomena. Although microdata may be anonymized by 
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removing obvious identifying information such as name and address without 
damage to the statistical analyses, such anonymization will rarely be con- 
sidered sufficient for confidentiality protection, since the rich socio-economic 
information in the microdata may often enable records to be identified by 
matching to another data source on known individuals or establishments. 
Agencies have therefore developed a number of ways of protecting confiden- 
tiality in this context. One common approach is to modify the microdata file 
by applying a statistical disclosure limitation (SDL) method, such as recod- 
ing or data perturbation, to those variables judged potentially identifying 
[Federal Committee on Statistical Methodology (2005)]. Such modification 
can, however, seriously reduce the utility of the microdata and it is there- 
fore important for the agency to be able to assess the protection provided 
by such methods in order to be able to make judgements about the degree 
of modification to apply. 

The aim of this paper is to develop methodology to assess the disclosure 
protection provided by the misclassification of one or more categorical iden- 
tifying variables. Misclassification is supposed here to arise in one of two 
ways. First, it may be the result of the deliberate application by the agency 
of an SDL method, specifically we consider the methods of data swapping 
[Dalenius and Reiss (1982)] and post-randomization or PRAM [Gouweleeuw 
et al. (1998)]. This paper is motivated by experience of the use of such meth- 
ods at government statistical agencies (especially in the United Kingdom) 
with microdata from social surveys on individuals or from population cen- 
suses. In these cases, the potential identifying variables which might be used 
for matching are invariably categorical. A second way in which misclassifi- 
cation may arise is as a result of measurement error which arises naturally 
in surveys and takes the form of misclassification for categorical variables 
[Kuha and Skinner (1997)]. In this case, we shall suppose that the agency 
has some information about the nature of the misclassification mechanism. 

In the current practice of statistical agencies, when the disclosure protec- 
tion of such methods is assessed, it is usually based upon simple measures, 
such as functions of the diagonal elements of the misclassification matrix 
[Willenborg and De Waal (2001), page 119], or a simple estimated probabil- 
ity that an apparent match is correct [Gouweleeuw et al. (1998)], or via the 
outcome of a record linkage experiment (see below). Reiter (2005) developed 
a more sophisticated approach by defining a measure of identification risk, 
based upon the modeling framework of Duncan and Lambert (1989), and 
showing how it could be assessed before and after the application of a num- 
ber of SDL methods, including data swapping. This focus on identification 
risk is often appropriate in government contexts, where judgments about 
protection are informed by legislation or codes of practice which express 
threats to confidentiality in terms of individual respondents being identi- 
fied. However, the need to model a very wide range of microdata variables 
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and relationships in Reiter's (2005) approach may hmit its apphcation in 
practice. In this paper we develop an approach which is based on a similar 
framework to Reiter (2005), but which retains some of the simplicity of the 
former methods. We achieve simplification by restricting the information set 
upon which the risk measure is conditioned, extending the approach of Skin- 
ner and Shlomo (2008). Our approach also extends Reiter (2005) by taking 
fuller account of the protection achieved from sampling. 

Assessing identification risk using record linkage experiments [e.g., Yancy, 
Winkler and Creecy (2002); Domingo-Ferrer and Torra (2001)] is natural 
given the threat that such methods pose [Fienberg (2006)]. The experiment 
typically involves matching records in the microdata file, masked by an SDL 
method, to records in the original unmasked file. The risk is often defined as 
the proportion of such matches which are correct [Spruill (1982)]. A problem 
with this approach is that it makes an unjustified assumption that a hypo- 
thetical intruder has access to data that are as good as the original data and 
may not take account of the disclosure protection provided by sampling. We 
shall show in the Appendix that our proposed approach to assessing iden- 
tification risk in the case of exact matching does, in fact, provide a closed 
form expression for the correct match proportion which would be estimated 
by an experiment using a form of probabilistic record linkage proposed by 
Fellegi and Sunter (1969). Record linkage experiments have the potential 
to capture the impact of a wider range of types of potential attack, includ- 
ing those that make explicit allowance for data masking and exploit greater 
computational power [Winkler (2004)], but consideration of such extensions 
is beyond the scope of this paper. 

Statistical modeling approaches to identification risk assessment have 
been proposed by a number of authors [e.g., Paass (1988); Duncan and 
Lambert (1989); Fuller (1993)]. It is generally assumed that an intruder 
seeks to identify an individual in the microdata by matching records to 
known individuals in the population using identifying variables, also called 
key variables, values of which are known both for the microdata records and 
for the known individuals. This paper builds on the literature which has 
used models for categorical key variables as a basis for assessing disclosure 
risk. Bethlehem, Keller and Pannekoek (1990) is a seminal contribution. 
We follow especially Skinner and Shlomo (2008), who considered the use of 
log-linear models to assess identification risk. Their work did not, however, 
consider the impact of SDL methods on risk, other than the recoding of key 
variables. 

The empirical work in this paper is based upon the 2001 population census 
in Great Britain, which will be used to provide population data to validate 
risk assessments for samples, viewed as representing potential sample sur- 
veys. Our focus will be on the impact of SDL methods on identification risk. 
The effects of these methods on the utility of potential data analyses is also 
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vitally important and we provide some information loss measures to analyze 
and compare the perturbation methods. 

Our paper is organized as follows. Measures of identification risk in the 
presence of misclassification are developed in Section 2. Since these measures 
depend upon population quantities which may be unknown, methods of es- 
timating these measures using sample data are considered in Section 3. Ap- 
plications using census data are presented in Section 4 for a random and tar- 
geted data swapping method and a random and targeted post-randomization 
method (PRAM). A further numerical illustration with multivariate misclas- 
sification is presented in Section 5. Section 6 contains a concluding discus- 
sion. 

2. Identification risk under misclassification. Consider the release of a 
microdata file consisting of records for a sample s = {1, 2, . . . , n} drawn from 
a finite population U of size A^. We suppose an intruder seeks to match 
a known target unit in [/ to a record in the file using C categorical key 
variables , . . . ,X'-'' . We assume the agency knows the intruder's choice 
of key variables. Possible departures from this assumption are discussed 
in Section 6. The variable formed by cross-classifying the key variables, as 
measured by the intruder on the target unit, is denoted X and its values 
are labeled 1,2,. . . ,K. The value of X recorded in the microdata, after the 
application of the SDL method (and natural measurement error), is denoted 
X. We treat the values of X for population units as fixed and suppose the 
values of X for the records in the microdata are determined independently 
by a misclassification matrix M, where 

(2.1) Fv{X = j\X = k) = Mjk. 

To assess the disclosure protection provided by misclassification, we imag- 
ine that the intruder observes a match between a specific sample unit A and a 
target population unit B, that is, observes Xa = Xb (where Xa is the value 
of X for unit A and Xb is the value of X for unit B), and measures disclosure 
risk in terms of the uncertainty as to whether A = B. A simple ad hoc mea- 
sure of this uncertainty is given by Mjj (or 1 — Mjj), where j is the common 
value of Xa and Xb- Willenborg and De Waal [(2001), page 121] propose 
that the agency specifies upper bounds for these diagonal elements of M ac- 
cording to the level of protection required. Following Reiter (2005), we define 
the identification risk as Pr(^ = B\data), where the values Xa and Xb are 
implicitly included in the data and the nature of the probability mechanism 
will be clarified later. A simplified approach to estimating this risk is given by 
Gouweleeuw et al. (1998), who make the very conservative assumption that 
the intruder knows that B is in the sample and approximate Pr(yl = B\data) 
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by Pr(XA = j\XA = j) = Mjj Fv{Xa = j)/ Ek Mjk Pr(XA = A:), which they 
estimate by 

(2.2) M,jfj/Y,Mjkfk, 

k 

where fk is the number of units in s for which X = k (they in fact use the 
odds rather than the probabihty). In contrast to the highly simphfying as- 
sumptions of Gouweleeuw et al. (1998), Reiter (2005) aUows for considerable 
generality by adopting a very wide definition of data in Pr(A = B\data)^ so 
that it may include all the values of Xi in the sample as well as the values 
of any other microdata variables. This creates not only a major modeling 
task to assess the probability of interest, but also the possibility that this 
probability will be sensitive to the specification of the model. 

We seek an intermediate position, avoiding the very conservative assump- 
tion that the intruder knows that B is in the sample, but reducing the scope 
of data in Y'i[A = B\data) to avoid the complex modeling issues. We define 
the matching variable Zj to be 1 if = Xb and otherwise and we take 
the data to consist of the values for i G s. We suggest that this is the 
critical information to consider when assessing the probability that an ob- 
served match is correct. We shall also restrict our attention further to the 
case when a unique sample unit matches B (so = 1 and Zi = ii i ^ a 
for some unit a € s). This is the worst case and thus of most interest, that 
is, the risk will be lower if B matches more than one sample unit. In this 
case, we obtain the following expression for the identification risk: 

Identification risk = Pi(A = B\Zi, . . . , Zn) 

(2.3) 

= PriEB)/Y,Fr{Ea), 

where Ea is the event that population unit a is sampled and its value Xa 
matches Xb and that no other population unit is both sampled and has a 
value of X which matches Xb- In order to allow for the effect of unequal 
probability sampling and the potential use of sampling weights, we suppose 
that units in the population U are selected independently into the sample s 
with inclusion probabilities vTj which may depend on the value X = j for the 
unit. Writing Xa = k and Xb = j and using our previous assumptions about 
the misclassification mechanism, we obtain Pr(£^a) = ajMj}i/{l — njMjk), 
where Oj = iTj Yiii^ ~ '^j^ji)^' the number of units in the popula- 

tion with X = j. Hence, 

Fr{A = B\Zi,...,Zn) 

(2.4) 

= [Mn/{l-7TjM,j)]^ 



^FkMjk/il-TTjM. 



jkj 
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This expression assumes the intruder does not know whether B £ s. If this 
event was known to arise and was included in the conditioning set, (2.4) 
should be modified by setting ttj = 1 and replacing Fk by fk- This produces 
an expression that is similar to that given earlier in (2.2) from Gouweleeuw et 
al. (1998) but makes fewer approximations. For expression (2.4), the identi- 
fication risk also assumes that the are part of the data, that is, known. In 
practice, this will often not be the case, as discussed by Skinner and Shlomo 
(2008), and it will be necessary to integrate the F^ out of this expression as 
will be discussed in Section 3. It follows from (2.4) that 



with equality holding if there is no misclassification. The extent to which 
the left-hand side of this inequality is less than the right-hand side measures 
the impact of misclassification on disclosure risk. 

If the inclusion probabilities ttj are all small, we may approximate (2.4) 



^ k ^ 

Moreover, if the population size is large, we have approximately F^Mjk ~ 
Fj, where Fj is the number of units in the population which would have 
X = j ii they were included in the microdata (with misclassification). Hence, 
a simple approximate expression for the risk, natural for many social sur- 
veys, is 



An alternative derivation of this result is provided in the Appendix un- 
der the assumption that the intruder adopts the probabilistic record linkage 
approach of Fellegi and Sunter (1969), making a link if the match variable 
Za = 1- The identification risk is the probability that the match is correct 
and the above approximation is obtained if the probability is defined with 
respect to the sampling scheme, the misclassification mechanism and a ran- 
dom selection of a pair for matching as in Fellegi and Sunter (1969). 

Another approximation to expression (2.4) is obtained by assuming the 
misclassification is small, say, Mjj = (1 — 6)(j)jj and = 6(j)jk {j 7^ k), 
where the (j) are fixed and — )• 0. In this case, we have 



Pi{A = B\Zi,...,Zn)<l/Fj 



by 




(2.5) 



Ft{A = B\Z: 



Zn)=M,j/F,. 



(2.6) 



Pt{A = B\Z 



■ 5 Zn) 

[Fj - FjM,,]/[F,Ad,,/{l - ^,M,,)]) 




or 



(2.7) 



PT{A = B\Zu...,Zn) 

« [M^/il - 7r,M,,)]/[(F,^,M2.)/(l - ttM,,) + F,]. 
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Note that none of approximations (2.5), (2.6) or (2.7) depend upon Mjk 
for j ^ k and so knowledge of these probabihties is not required in the 
estimation of risk. 

The definition of risk in (2.3) apphes to a specific record. Agencies wih 
also usually wish to consider aggregate measures to enable them to make 
judgements about the whole file. Following Skinner and Shlomo (2008), we 
define an aggregate measure as the sum of the record-level measures in (2.4) 
across sample unique records: 



where SU is the set of key variable values which are sample unique. This 
measure may be interpreted as the expected number of correct matches 
among sample uniques. For some purposes, an agency might find this mea- 
sure easier to interpret if it is transformed into a measure with an upper 
bound, such as by dividing by the number of sample uniques to obtain a 
proportion. However, we shall stick with the untransformed r as a mea- 
sure of the total number of units, for example, individuals, threatened with 
identification. 

We also consider, for comparison, a related measure which could be used 
if the misclassification status of microdata records is known. Let SUCC 
denote the set of key variable values which are sample unique and where 
these sample unique values have been correctly classified. The measure is 
given by 



and again may be interpreted as the expected number of correct matches 
among sample uniques. We also define r* as the corresponding measure of 
risk in the absence of perturbation, that is, the sum of ^/ Fj across key values 
which are unique in the sample with respect to X. 

3. Risk estimation. An agency wishing to apply an SDC method to sur- 
vey microdata will generally not know the values of Fj or Fj appearing 
in the risk expressions. We do suppose that the values of Mj^ are known. 
Skinner and Shlomo (2008) discuss the estimation of risk in the absence of 
misclassification based on a Poisson log-linear model. In this case, expression 
(2.4) reduces to l/Fj and their broad approach is to define the risk as the 
conditional expectation of this quantity given the observed data and to es- 
timate this expectation using data for the sample counts fj,j = l,2,...,K, 
for which a log-linear model is fitted. Expression (2.5) provides a simple way 
to extend their approach to misclassification provided Mjj is known. Since 



(2.8) 



T = 




(2.9) 




j&SUCC 
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the fj, j = 1,2, . . . , K , represent the available data, all that is required is 
to ignore the misclassification and estimate the expectation of 1/Fj given 
the data from the /j, j = 1, 2, . . . , K, as in Skinner and Shlomo (2008), that 
is, by fitting a log-linear model now to the fj, j = 1,2, . . . , K, following the 
same criteria as before. This results in an estimate E{l/Fj\fj = 1) based 
on the assumptions of the Poisson distribution for the population and sam- 
ple counts. These estimates should be multiplied by the Mjj values and 
summed if aggregate measures of the form in (2.8) are needed. It would ap- 
pear to be rather more complex to estimate the expressions including terms 
in Fj. In the presence of complex sampling, the estimation method may be 
adapted using the method of pseudo maximum likelihood estimation [Rao 
and Thomas (2003)] by incorporating survey weights in the estimation as 
discussed by Skinner and Shlomo (2008). 

4. Application of perturbative disclosure limitation techniques. In this 
section we consider two specific perturbative SDL techniques used at statis- 
tical agencies: data swapping and the post-randomization method (PRAM). 
Both techniques introduce misclassification of the key variables to lower the 
probabilities of identifying individuals. We present examples of how to assess 
the impact of these techniques on identification risk. Since the misclassifi- 
cation is under the control of the statistical agency, the misclassification 
matrix M is known. 

4.1. Data swapping. The method of data swapping is based on exchang- 
ing the values of one or more key variables between pairs of records. In 
order to minimize bias, the pairs of records are typically selected within 
strata defined by control variables, such as the age and sex of the individ- 
ual. In addition, the perturbation can be targeted to high-risk records that 
are more likely to be population uniques, for example, on rare ethnicities. It 
is common that geographic variables are swapped between records for the 
following reasons: 

• For given values of the control variables, the sensitive variables are likely 
to be relatively independent of geography and, therefore, it is expected 
that less bias will occur. In addition, swapping geography will not nor- 
mally result in inconsistent and illogical records. By contrast, swapping 
a variable such as age would result in many inconsistencies with other 
variables, such as marital status and education. 

• At a higher geographical level and within control strata, the marginal 
distributions are preserved. 

• The level of protection increases by swapping variables which are highly 
"matchable" such as geography. 
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For this experiment, we carry out a simple data swapping procedure where 
the geography variable of Local Authority District (LAD) is exchanged be- 
tween a pair of individuals. The population includes N = 1,468,255 individ- 
uals from an extract of the 2001 United Kingdom (UK) Census. We drew 
1% Bernoulli samples (n = 14,683) and define six key variables for the risk 
assessment: Local Authority (LAD) (11), sex (2), age groups (24), marital 
status (6), ethnicity (17), economic activity (10), where the numbers of cat- 
egories of each variable are in parentheses {K = 538,560). We implement a 
random data swap by drawing a sub-sample of 10% and 20% in each of the 
LADs. The remaining individuals are not perturbed. On the sub-samples in 
each LAD, half of the individuals are flagged. For each flagged individual, 
an unflagged individual is randomly chosen within the sub-sample and their 
LAD variables swapped, on condition that the individual chosen was not 
previously selected for swapping and that the two individuals do not have 
the same LAD, that is, no individual is selected twice for producing a pair. 
We also implemented a 10% and 20% targeted data swap where the LAD 
variable is swapped separately within two groups defined by "White British" 
and "Other" ethnicities. For the 20% swap, LADs were swapped randomly 
between all pairs of individuals in the "Other" group and a small percentage 
(7%) of individuals in the "White British" group. This swapping rate was 
chosen so that the total percentage of swapped individuals would be 20% 
as in the random data swapping. For the 10% swap, LADs were swapped 
randomly from among the "Other" group that compose 10% of the total 
individuals in the sample. 

The misclassification matrix M for the data swapping designs can be 
expressed simply in terms of the 11 x 11 misclassification matrix, denoted 
M3 = [Mj^], for the LAD variable g: 

For the random swap: 

• On the diagonal: M|^. = 0.9 or M|j. = 0.8 for the 10% and 20% swaps 
respectively. 

• Off the diagonal: = 0.1 x Uk/iEi^j "0 = 0.2 x Uk/iT^i^^ 
where is the number of records in the sample in LAD fc, A; = l,2,...,ll, 
for the 10% and 20% swaps respectively. 

For the targeted swap on the 10% swap, the values AfJ^ for the "Other" 
ethnicity are calculated as follows: 

• On the diagonal: Mjj = 0.25. 

• Off the diagonal: iWj^ = 0.75 x n2k/{^i^jn2i), where n2k is the number 
of records in the sample with "Other" ethnicity in LAD A;, A; = l,2,...,ll. 

For the targeted swap on the 20% swap, the misclassification matrix M is 
defined separately according to the "White British" and "Other" ethnicities 
as follows: 
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• On the diagonah M^-- = 0.93. 

• Off the diagonal: Mj^ = 0.07 x '^i/), where riik is the number 
of records in the sample with "White British" ethnicity in LAD k, k = 
1,2,. ..,11. 

The values Mf, for the "Other" ethnicity are calculated as follows: 

• On the diagonal: Mjj = 0. 

• Off the diagonal: Mj^ = 1 x n2k/{^i^jn2i), where is the number of 
records in the sample with "Other" ethnicity in LAD k, k = 1,2, ... ,11. 

4.2. The post-randomization method (PRAM). A more direct method 
that is used for exchanging values of categorical variables is PRAM. For this 
method, values of categories in a given record are changed or not changed 
stochastically according to a misclassification matrix. This matrix is chosen 
to preserve expected marginal frequencies of the variables. Let be the row 
vector of sample frequencies of the different categories of key variable X'^ and 
pC _ j'^ jnfi bg tiie corresponding vector of sample proportions, where n is the 
sample size. For each record, the category of X'^ is changed or not changed 
according to the probabilities in the misclassification matrix M'^ . Let be 
the row vector of perturbed frequencies. Then E{f''\f^) = f'^M'^, where the 
expectation is with respect to the misclassification mechanism. The matrix 
may be expected to be nonsingular since small perturbation rates should 
imply that it is "close to" diagonal. The inverse M'^ can be used to obtain 
an unbiased estimator of the original frequency vector: = f^M^ . In 
addition, we can place the condition of invariance on the matrix M'^, that 
is, f^M^ = f^, and preserve the expected marginal frequencies. This releases 
the users of the perturbed file of the extra effort to obtain unbiased moment 
estimates of the original data, since itself will be an unbiased estimate of 

r- 

To obtain an invariant transition matrix, the following two-stage algo- 
rithm is applied [see Willenborg and De Waal (2001)]. Let M"^ be the 
misclassification matrix: M?^ = Pr(X'^ = k\X'^ = j), where j represents the 
original category and k the perturbed category. Now calculate the matrix 
Q using the Bayes formula by Qlj = Pr(X^ = jlX"" = k) = Mj=^Pr(X'= = 

j) /\^iMf^Y'T[X^ = /)]. We estimate the entries of this matrix by Q^ - = 
^jkPj/l'^l ^ikPl]^ where p'j is the sample proportion in category j. The ma- 
trix R'^ = M'^Q'^ is invariant, that is, p'^R'^ = p^, since Rfj = J2k\Pj^ik^jk/ 
Mij^pf] and PiRfj = Xlfc Pj^fk ~ Pj - '^^^ vector of the original propor- 
tions p^ is the eigenvector of R. In practice, Q'^ can be calculated by trans- 
posing matrix M^, multiplying each column j by pj and then normalizing its 

rows so that the sum of each row equals one. We define R"^ = aR^ + (1 ~ ot)I^ 
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where / is the identity matrix of the appropriate size. is also invariant 
and the amount of misclassification is controlled by the value of a. 

We conduct a second experiment using the same data and setup described 
in Section 4.1 and PRAM to perturb the geographical variable LAD. For the 
random perturbation, an 11 x 11 misclassification matrix is defined for 
the 11 categories of LAD where the diagonal elements are 0.9 and 0.8 and 
the off-diagonal elements are equal to a probability of 0.1 and 0.2 for the 10% 
and 20% perturbation respectively. The invariant misclassification matrix is 
calculated with a = 0.55. For each individual, a random uniform number 
between and 1 is generated and the category of the LAD changed (or not 
changed) if it is in the interval defined by the cumulative probability. For 
the 10% targeted perturbation, we define the misclassification matrix for the 
"Other" ethnicities with 0.25 on the diagonal and 0.75 on the off-diagonals 
and the invariant parameter a = 0.85. For the 20% targeted perturbation, 
we define the misclassification matrix for the "Other" ethnicities with on 
the diagonal and 1 on the off-diagonals, and the misclassification matrix for 
the "White British" ethnicity with 0.93 on the diagonal and 0.07 on the 
off-diagonals. For both matrices, the invariant parameter is q = 1. 

4.3. Results of disclosure risk assessment. Since we know the misclassi- 
fication matrix M and the true population counts Fj in these experiments, 
we can assess the performance of expressions (2.5)-(2.7) as approximations 
to (2.4). We do this by summing all the expressions across sample unique 
records, as in the aggregate risk measure r in (2.8) and comparing the re- 
sulting sums. We also compare these measures to the measure in (2.2) of 
Gouweleeuw et al. (1998). In addition, we consider the more practical sit- 
uation when neither the Fj nor the Fj are known to the agency, all that 
is observed is the "misclassified" sample and the matrix M. In this case, 
we carry out the risk estimation as described in Section 3 through the use 
of the Poisson log-linear model on the sample counts fj. The log-linear 
model was chosen using a forward search algorithm and the outcome of 
goodness of fit statistics as developed in Skinner and Shlomo (2008). We 
calculate the naive estimated risk measure obtained from the log-linear 
model on the misclassified sample and the adjusted estimated risk mea- 
sure, taking into account the misclassification matrix. The experiments were 
repeated under different samples and each perturbation method applied in- 
dependently and we found that all of the experiments produced similar 
results. Table 1 presents results of one of the simulation experiments for 
each of the perturbation methods: random and targeted data swapping and 
PRAM. 

The estimates presented in Table 1 for the risk of identification are similar 
for random data swapping and PRAM. Misclassification reduces the risk in 
the file from about r* = 360 to about T%n = 290 for the 20% perturbation 
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Table 1 

Identification risk estimates for microdata samples generated from UK 2001 Census 
subject to perturbative SDL methods — Risk measure r* no misclassification = 358.1 



SDL method 



Random Targeted 



Identification risk measures 


Swap 


PRAM 


Swap 


PRAM 




10% perturbation 






Identification risk measures 


for perturbed data 


with known population counts 


Risk measure r in (2.8) 


321.6 


325.8 


146.3 


161.6 


Approximation in (2.5) 


321.4 


325.5 


146.2 


161.4 


Approximation in (2.6) 


317.7 


321.7 


144.8 


159.8 


Approximation in (2.7) 


321.6 


325.6 


146.3 


161.6 


Risk measure tqq in (2.9) 


316.6 


318.2 


149.5 


160.3 


Estimated risk measures based 


on sample 


data 




Risk measure in (2.2) 


2486.7 


2489.1 


1749.1 


1899.3 


Naive risk measure 










(Poisson log-linear model 










on misclassified sample) 


343.2 


347.6 


297.2 


285.4 


Estimated risk measure 










(Poisson log-linear model 










adjusted for misclassification) 


308.8 


312.7 


142.7 


157.9 




20% perturbation 






Identification risk measures 


for perturbed data 


with known population counts 


Risk measure r in (2.8) 


298.9 


299.7 


82.2 


133.8 


Approximation in (2.5) 


298.4 


299.3 


82.1 


133.7 


Approximation in (2.6) 


280.4 


283.5 


81.7 


132.7 


Approximation in (2.7) 


298.9 


299.8 


82.2 


133.8 


Risk measure tqq in (2.9) 


292.8 


292.2 


85.0 


133.4 


Estimated risk measures based 


on sample 


data 




Risk measure in (2.2) 


2264.0 


2311.7 


1419.8 


1688.2 


Naive risk measure 










(Poisson log-linear model 










on misclassified sample) 


358.6 


349.5 


262.5 


285.2 


Estimated risk measure 










(Poisson log-linear model 










adjusted for misclassification) 


286.8 


283.1 


90.3 


133.2 



and = 320 for the 10% perturbation for those methods. The measure t* 
is interpreted as the expected number of correct matches which an intruder 
would make if matches were attempted with all sample unique records. The 
decrease in this measure from 360 to 290 as a result of misclassification is 
modest since a large number of records remain unchanged. An alternative 
interpretation of r could be obtained by dividing by the number of sample 
uniques to give the proportion of sample uniques which would be expected 
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to be identified correctly. This proportion ranges in Table 1 between 0.116 
for the 10% Random Swap, 0.053 for the 10% Targeted Swap, 0.108 for the 
20% Random Swap and 0.030 for the 20% Targeted Swap. 

The identification risk is reduced considerably with the targeted data 
swapping since many more sample uniques are perturbed. The misclassifi- 
cation is reduced from about r* = 360 to about Tq^ = 85 for data swapping 
and T^(j = 130 for PRAM for the 20% perturbation and to about t^^ = 150 
for data swapping and = 160 for PRAM for the 10% perturbation. The 
three approximations to the risk measure in (2.8) all provide good results, 
although the approximation in (2.6) slightly underestimates. The measure 
in (2.8) relies on knowledge of both the full misclassification matrix M and 
the population counts Fj. In contrast, the approximations (2.5), (2.6) and 
(2.7) only require knowledge of the probability of not misclassifying a record, 
that is, the probabilities on the diagonals. The alternative risk measure t'^q 
in (2.9) also turns out to behave similarly to (2.8). The value of the mea- 
sure in (2.2) of Gouweleeuw et al. (1998) is much higher than the values 
of the other measures, reflecting the very conservative assumption that the 
intruder knows that the target unit is in the microdata sample. In practice, 
the population counts will generally be unknown to the statistical agency 
(and the intruder) for survey data. We therefore consider the method in 
Section 3 based upon the Poisson log-linear model. The estimated aggregate 
risk measures appear to perform well with estimates for the risk measure 
under misclassification of about 285 for random data swapping and PRAM 
under the 20% perturbation and about 310 for random data swapping and 
PRAM under the 10% perturbation. The estimated aggregate risk measures 
are about 140 for targeted data swapping and 160 for targeted PRAM for 
the 20% perturbation and about 90 for targeted data swapping and 130 for 
targeted PRAM for the 10% perturbation. 

Another important consideration when assessing disclosure risk for releas- 
ing microdata is the individual per-record (record-level) disclosure risk mea- 
sures in (2.4). Individual records with high disclosure risk might be subjected 

Table 2 

Cross-classification of sample uniques according to per-record risk measures in (2.4) and 
estimates based on Poisson log-linear model under 20% random data swap 



Estimates from Poisson log-linear model 



Per-record risk measures from (2.4) 


0.00 0.09 


0.10-0.49 


0.50 1.00 


Total 


0.00-0.09 


1961 


133 


4 


2098 


0.10-0.49 


180 


325 


76 


581 


0.50-1.00 


8 


69 


75 


152 


Total 


2149 


527 


155 


2831 
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Fig. 1. Scatterplot of imdividual per-record risk measures in (2.4) against estimated risk 
measures based on the Poisson log-linear model under 20% random data swap. 



to further tailored perturbation. In Figure 1, we plot the per-record (record- 
level) risk measures in (2.4) for the sample uniques against the estimated 
adjusted risk measures (as described in Section 3) based on the Poisson log- 
linear model for the experiment based on 20% random data swapping. In 
addition, we summarize this bivariate distribution for the sample uniques 
in a two-way table in Table 2. In both of the analyses we see a good fit 
between the risk measures in (2.4) and their estimated risk measures. The 
Spearman's rank correlation is 0.91. 

4.4. Results of information loss assessment. The utility of microdata 
that has undergone data masking techniques is measured here in terms of 
the closeness of the results of an analysis based upon the perturbed data 
compared to the same analysis based upon the original data. The nature of 
the results and the type of analysis depend on user requirements. In gen- 
eral, microdata is multi-purpose and used by many different users. For this 
assessment we use the following three information loss measures reflecting 
distortions of distributions in two-way tables, as considered by Gomatam 
and Karr (2003) and Shlomo and Young (2006): 

• Relative absolute average distance per cell: Let D represent a frequency 
distribution for a two-way table produced from the microdata and let 
D{r,c) be the frequency in the cell in row r and column c. The distance 
metric is 

RAAD{Do„g,Dpert) = 100 X {Davg " AAD)/Davg, 

where the average cell size is defined as 

Davg = Dorig jr, c) / EC 
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with R the number of rows and C the number of columns m the table, 
and the AAD metric is defined as 

AAD{Dorig,Dpert) = ^ \Dpert{r, c) - D orig{r, c)\/ RC 
r,c 

with pert and orig referring to the perturbed and original tables respec- 
tively. The RAAD provides a measure of the average absolute perturba- 
tion per cell compared to the average cell size of the table. 

• Impact on measures of association: 

RCV{Do„g,Dp,ri) = 100 X {CV{Dpert) " CV {D o„g)) / CV {D a„g) , 

where 

CV{D) = ^xVmin(i?- 1,C- 1) 

is Cramer's measure of association, defined in terms of x^i the usual Pear- 
son chi-squared statistic for testing independence in the two-way table. 
The RCV provides a measure of attenuation of the association. 

• Impact on an ANOVA analysis: another form of bivariate analysis con- 
sists of comparing proportions in a category of a column (outcome) vari- 
able between categories of a row (explanatory) variable. Let P^{r) = 
D{r,c)/ '^^D{r,c) be the proportion in column c for row r and define 
the between-row variance of this proportion by 

BV{P^) = Y,{nr)-n'/{R-l), 

r 

where = X^r-^('^''')/Z]rc-^(^''')- '^^e measure of information loss is 

BVR{Pl„^,P;,,,) = 100 X (5F(P,^,rf) - 5I/(P„%))/5F(P,%). 

The BVR provides a measure of attenuation of between group differences 
in an ANOVA analysis. 

Table 3 presents results of the information loss measures on the misclas- 
sified samples used in Table 1. We obtain similar results for the information 
loss measures when comparing data swapping and PRAM with an expected 
improvement under the smaller perturbation rate of 10%. The targeted per- 
turbation shows slight improvements to the RAAD compared to the random 
perturbation under both perturbation rates. The targeted perturbation is 
generally worse for the RCV and BVR compared to the random pertur- 
bation under the 10% perturbation rate, but there are slight improvements 
under the 20% perturbation rate. The impact on the BVR for other ethnic 
groups (not shown) was mixed with most of the ethnic groups following the 
same pattern of attenuation as seen for the "Indian" ethnic group. There 
were a few exceptions due to small sample sizes. For example, we obtained a 
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Table 3 

Information loss measures for microdata samples generated from UK 2001 Census 
subject to three perturbative SDL methods 



SDL method 
Random Targeted 
Information loss measures Swap PRAM Swap PRAM 



10% perturbation 



RAAD on LAD x ethnicity 


98, 


.5 


98.1 


97.4 


97.2 


RAAD on LAD x economic activity 


97, 


.0 


96.9 


96.1 


95.8 


RCV on LAD x ethnicity 


-9, 


.9 


-10.4 


-13.3 


-12.9 


RCV on LAD x economic activity - 


■10, 


.8 


-9.8 


-11.0 


-10.4 


BVR on prop. "White British" across LAD - 


-20, 


.9 


-23.8 








BVR on prop. "Indian" across LAD - 


12, 


.6 


-13.0 


-18.9 


-17.3 


20% perturbation 












RAAD on LAD x ethnicity 


97, 


.4 


97.2 


96.5 


96.4 


RAAD on LAD x economic activity 


95, 


.8 


95.5 


95.0 


94.9 


RCV on LAD x ethnicity 


■20, 


.4 


-20.4 


-17.8 


-16.9 


RCV on LAD x economic activity - 


■18, 


.1 


-17.0 


-16.2 


-14.4 


BVR on proportion "White British" across LAD - 


37, 


.4 


-39.6 








BVR on proportion "Indian" across LAD - 


37, 


.5 


-39.1 


-34.2 


-29.5 



positive value for the BVR of "Chinese" ethnicity. Overall, the considerable 
reduction in disclosure risk achieved by the 20% targeted data swapping 
in Table 1 does not appear to be offset by any major reduction in utility 
compared to the other methods. 

In Figure 2 we plot a risk-utility map [Duncan, Keller-McNulty and Stokes 
(2001)]. The points on the map represent different candidate releases, that 
is, perturbation methods with different levels of perturbation. In addition to 
the levels considered earlier, we also include 2% and 5% targeted and random 
perturbation. The points are denoted T for targeted or R for random; 20 for 
20%, 10 for 10%, 5 for 5% or 2 for 2%; and S for swapping or P for PRAM. 
The points are plotted against the risk measure r in (2.8) on the Y-axis 
and the information loss measure RAAD for LAD x ethnicity on the X- 
axis. We see that, at the same level of information loss between the targeted 
10% perturbation and the random 20% perturbation with respect to the 
RAAD, we obtain lower disclosure risk with the targeted 10% perturbation. 
The same applies to the targeted 5% perturbation and the random 10% 
perturbation, with the targeted 5% perturbation having less disclosure risk 
than the random 10% perturbation at the same level of information loss. We 
draw a line to connect points on the risk- utility frontier [Gomatam, Karr and 
Sanil (2005)] and note that in all cases, at given levels of information loss, 
the targeted data swapping provides the lowest disclosure risk compared 
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to the other methods, although there is httle difference between targeted 
swapping and targeted PRAM. Targeting did not appear to lead to much 
greater information loss for the other measures in Table 3 and the general 
conclusion here is that targeting seems useful, enabling less perturbation to 
be applied and hence less information loss for a given level of risk protection. 
Of course, this finding could vary in other settings and an agency could 
use a similar risk-utility approach, based on its own data, to determine its 
preferred SDL approach. 

5. Impact of misclassifying multiple key variables. The previous section 
only provided estimates of the impact of misclassifying one key variable. 
In this section we provide a further numerical illustration to demonstrate 
the potential impact of misclassifying multiple key variables. We consider 
a simple setup where the C key variables , . . . ,X'~" are independent and 
binary. Their values in the external information and the microdata are de- 
noted and X^ respectively, c = 1, . . . , C. We suppose that Pr(X'^ = 2) = p, 
Pr(X'= = 1) = 1 - Pr(X^ = 2|X^ = 1) = 6*1 and Pr(X'= = llX'^ = 2) = 62 for 
c = 1, . . . , C. The misclassification probabilities Mjk in (2.1) will thus consist 
of products of C terms, each term being one of , 1 — , ^2 or 1 — ^2 • To force 
X and X to share the same marginal distribution, we set O2 = (1 —p)0i/p 
so that Fv{X'^ = 1) = p and, to simplify, write 61 = 9. 

In our experiment we generated values of X for a population of size N, 
drew a sample of size n by simple random sampling and then generated 
the values X. Various choices of {N,n,C,p,6) were considered. We also 
generated X for all population units so that Fj could be computed. 

We report values of risk measure (2.5) summed over sample uniques 
^jj/^jj ™ Figure 3 for iV = 100,000, n = 2000, p = 0.2 and for various 
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Fig. 3. Risk measure for different numbers of key variables and rates of misclassification. 

choices of C and 9. Note that the number of sample uniques increases as 
we add in more binary key variables. For C = 11 we have about 240 sample 
uniques and for C = 30 we have about 1960 sample uniques. In the absence 
of misclassification, we find that the risk increases monotonically and rapidly 
with C. This is because the number of population uniques is increasing with 
C and the fact that any observed match with a population unique must be a 
true match. On the other hand, in the presence of misclassification, we find 
that the risk does not increase monotonically, rather it reaches a maximum 
and then declines. As expected, the more misclassification, the lower the 
disclosure risk. 

We do not present information loss measures for the simulation since their 
values follow theoretically. For any analysis involving a given set of variables, 
say, the estimation of a table cross-classifying two particular key variables, 
the addition of further key variables will have no systematic impact on any 
of the information loss measures, since each of the variables of interest will 
be perturbed in the same way, irrespective of the inclusion of other key 
variables. The only variation we might expect to observe would be as a 
result of simulation variation. Any information loss function in Figure 3 
should therefore be flat. 

6. Discussion. In this paper we have shown how existing methods for 
assessing identification risk in survey microdata may be extended in a rela- 
tively simple way to capture the impact of SDL methods based on misclassifi- 
cation. We presented a general expression for the risk under misclassification 
in (2.4) and showed that the simple formula in (2.5) provided a good ap- 
proximation to this expression in two experiments based upon UK census 
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data. The advantage of the formula in (2.5) is that it enables the extension 
of existing risk assessment methods for unpeturbed data based on Poisson 
log-linear models, as discussed in Skinner and Shlomo (2008), to handle 
perturbative SDL methods. We demonstrated this extended approach also 
with the census data and provided a disclosure risk-data utility analysis. We 
showed how a targeted SDL method could dominate corresponding random 
methods. 

One challenge faced by agencies when assessing identification risk is the 
need to make assumptions about the information available to the intruder, 
specifically the nature and number of key variables. We conducted a nu- 
merical experiment to assess the sensitivity of the identification risk to the 
misclassification of different numbers of key variables. In the absence of mis- 
classification, the risk can increase rapidly with the number of key variables. 
We observed that misclassification can, however, dominate this effect with 
the risk eventually declining as the number of key variables increases. This 
is potentially an encouraging finding for agencies, since the sensitivity of the 
identification risk to departures from assumptions about the choice of key 
variables may be reduced in some settings when the kinds of SDL methods 
considered here are used and, in cases such as in Figure 3, there may even 
be a natural upper bound for the risk across plausible choices. 

Another issue faced by agencies is whether to release values of the pa- 
rameters of the SDL method employed, for example, the swapping rate. The 
information loss measures used in Section 4.4 assume that users of the mi- 
crodata simply ignore the perturbation in their analyses of the data. The 
agency's aim is to find an SDL method for which both the information loss 
and the disclosure risk are considered satisfactorily small. If this is not fea- 
sible, then it may be necessary for the agency to resort to an SDL method 
which leads to nonnegligible distortion of analyses. In this case it may be 
desirable for data analysts to be provided with values of the parameters of 
the SDL method to enable them to undertake valid inference, as discussed, 
for example, in Gouweleeuw et al. (1998) for PRAM (note that our use 
of invariant PRAM was designed to avoid this need). The disclosure risk 
implications of releasing such SDL parameters will not be pursued further 
here. 

The findings of this paper are not only relevant to understanding the 
impact of SDL methods, but also to the assessment of risk, before the ap- 
plication of SDL methods, in a way which more realistically takes account 
of the errors of classification which arise in survey data from measurement, 
coding and processing as well as from imputation for missing data, providing 
the agency has estimates for the diagonal elements of the misclassification 
matrix. 
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APPENDIX: DERIVATION OF (2.5) UNDER PROBABILISTIC 
RECORD LINKAGE 

Suppose, as before, that a microdata record i is linked to a target unit 
B by comparing the values of Xi and Xb- Following the approach of Fel- 
legi and Sunter (1969), let 'y{Xi,XB) = j if Xi = Xb = j, j = 1, . . . ,K, and 
'j^Xi^Xb) = K + 1 \i Xi ^ Xb and suppose that exact matching is used, 
so that a link is made if ^{Xi,XB) < K. Suppose the intruder draws the 
pair (i,-B) at random (with equal probability) from the set of pairs s x s*, 
where s* is the subset of units in U appearing in the external database 
from which the intruder selects B. Partition s x s* as M = {{i,B)\i = i?} 
and U = {{i,B)\ii^B} and let m(j) = Pr[7(Xi, X^) = j|(i,5) gM], u{j) = 
Pr[7(Xi,Xij) =i|(z,5) G U] and p = Pr[(z,5) G M], where Pr(-) is defined 
with respect to the selection of {i,B), the selection of the sample s and the 
misclassification mechanism. Then the identification risk for a linked pair 
{i,B) for which Xi = Xb = j is given by 

cP, = Fr[{^,B) G M|(X„Xb) = j] = "^^^ ^. 

m[j)p + u{j){l -p) 

A large sample size approximation gives m{j) ~ Mjjfj/n* , u{j) ~ {irFjfj — 
TTMjjfj)/{nn* — irn*), p = vr/n, where fj is the number of units b in s* for 
which Xf) = j and n* is the size of s* . It follows that (pj ~ Mjj/Fj irrespec- 
tive of the manner in which s* is selected from U. Skinner (2008) provides 
further discussion of identification risk under probabilistic record linkage. 
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